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We carry out a theoretical study of the vibrational and relaxation properties of naturally-occurring 
proteins with the purpose of characterizing both the folding and equilibrium thermodynamics. By 
means of a suitable model we provide a full characterization of the spectrum and eigenmodes of 
vibration at various temperatures by merely exploiting the knowledge of the protein native structure. 
It is shown that the rate at which perturbations decay at the folding transition correlates well 
with experimental folding rates. This validation is carried out on a list of about 30 two-state 
folders. Furthermore, the qualitative analysis of residues mean square displacements (shown to 
accurately reproduce crystallographic data) provides a reliable and statistically accurate method 
to identify crucial folding sites/contacts. This novel strategy is validated against clinical data for 
HIV-1 Protease. Finally, we compare the spectra and eigenmodes of vibration of natural proteins 
against randomly-generated compact structures and regular random graphs. The comparison reveals 
a distinctive enhanced flexibility of natural structures accompanied by slow relaxation times at the 
folding temperature. The fact that these properties are intimately connected to the presence and 
assembly of secondary motifs hints at the special criteria adopted by evolution in the selection of 
viable folds. 



INTRODUCTION 



The backbone structure of globular proteins displays a notable degree of order and organization resulting in sec- 
ondary motifs such as a— helices and /3— sheets and in their optimal arrangement into a compact shape. The special 
neatness and regularity of native conformations impacts on, or rather reflects, several features that are distinctive of 
naturally occurring proteins 0, |[ ||, Hj. Indeed, experimental and theoretical studies have shown that native structures 
conceal a wealth of information about aspects as diverse as native elastic properties, folding rates and key folding stages 
IHj Hj 01 i|- A promising graph theoretical approach 10| can be used to ifentify rigid and flexible regions in proteins, 
and to investigate the emergence of such flexible regions in unfolding processes. Recent topology-based approaches, 
which are independent of sequence specificity, have revealed that a schematic, coarse-grained structural description is 
sufficient to obtain results in remarkable qualitative and even quantitative agreement with known experimental facts. 

For example, concerning near-native vibrations, one is entitled to replace the detailed interactions among amino 
acids which are in contact in the native state with springs acting on effective centroids (usually the Ca atoms) pT| . 
Topological folding models, on the other hand, try to characterize how the loss of configurational entropy contrasts 
the progressive establishment of native interactions in a folding process ^, |l^, |l^ . 

In the present study, we focus on a topology-based model that incorporates both the aspects mentioned above. It 
consists of a beads- and-springs model but, at variance with other approaches, the strength of the effective interaction 
depends on the temperature of the thermal bath [ p8| . Since the model is amenable to analytic treatment, it is 
possible to characterize rigorously both the thermodynamics as well as the vibrational/relaxation dynamics at various 
equilibrium stages of the folding process (not only near the native state) . 

The scope of the present study is two-fold. First we examine how the organization of native contacts impacts on 
fundamental properties of proteins and, hence, to what extent the knowledge of the contact map can be exploited to 
predict experimentally verifiable quantities. Our second goal is to repeat the same analysis in the context of generic 
disordered globular structures. From the comparison of the outcomes we aim at finding clues about the criteria 
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adopted by nature in the selection of viable folds. 

The vibrational spectrum at finite temperature allows to determine typical relaxation times which, in turn, correlate 
with high accuracy with experimental folding rates. This is rather surprising and unexpected since the relaxation 
time refers to near equilibrium situations whereas the folding process is far from equilibrium. 

Furthermore, the comparative study of analogous static and vibrational properties of disordered globular structures, 
or otherwise random graphs, and proteins shows that the latter have very uncommon properties in terms of flexibility 
and mechanical relaxation times that can be exploited to predict not only folding rates but also the key sites of an 
enzyme and amino acid vibrational amplitudes in partially folded states. These distinctive mechanical properties 
reflect the hierarchical assembly of protein native states and, in particular, can be traced back to the presence 
and organization of secondary elements. Hence, they add to the increasing evidence that the emergence of such motifs 
was promoted by special criteria operated by nature to select viable folds |||, |2^, |2^, |2^, . 

The straightforward implementation of concepts presented here, makes the model under consideration a useful tool 
to characterize the folding properties of a protein, complementing alternative theoretical strategies or more labor- 
intensive experimental techniques. 

THEORY 

The model under consideration incorporates two features that previous theoretical investigations have proved suc- 
cessful for protein modeling. In spirit, it belongs to the class of Go- models jl2), since the energy scoring function 
introduces a bias in structure space favoring the formation of native contacts. This ensures that the native structure 
under examination has minimum effective energy. Furthermore, we make use of the observation that, near equilibrium, 
the dynamics resulting from complicated atomic interactions can be well-reproduced by simple harmonic potentials 
[24I ^ at least for time intervals shorter that 1 ns |Q . The starting point of our analysis is the following effective 
harmonic Hamiltonian |]l8| (for its derivation, see Appendix ): 

i/ = ^riL,,rj+/(r) , (1) 

hi 

where ri denotes the distance-vector of site i from its position in the native structure, and / is a term that only 
depends on the equilibrium temperature. The entries of the symmetric matrix, L, incorporate the elastic forces 
that act on each residue to restore the native separation with its nearest neighbors in sequence (effect of the peptide 
bonding) and with other amino acids in interaction in the native state. The peptide bond is modeled by springs whose 
associated Boltzmann weight is independent of temperature. This reflects the little change of the peptide coupling 
over the range of temperatures where unfolding/refolding transitions are studied. In addition to this contribution, at 
variance with previous approaches, the energy scoring-function includes temperature-dependent interactions between 
the pairs of residues contacting in the native state. The list of such interactions is summarized in the contact map, 
A, whose entries. Ay, are 1 if residues i and j are in contact in the native state (i.e. their Ca separation is below 
the cutoff c = 7.5 A) and otherwise. For simplicity we refer to these temperature-dependent interactions as non- 
covalent, although they also act on consecutive residues. The strength of such non-covalent springs is calculated 
self-consistently according to the method described in Appendix . This makes the model non-linear but still very 
tractable. When the temperature of the model is zero (this mathematically-convenient idealization corresponds to 
a physiological temperature where the protein is stable in its native state), all springs have the same strength. As 
temperature is switched on, this strength is reduced by a deterministic amount that is larger for springs where the 
largest stretching is observed (see Appendix ). 

RESULTS AND DISCUSSION 
Vibrational properties of proteins and disordered structures 

At zero temperature, when the non-covalent springs are equally strong and dominate in number over the covalent 
ones, the model is equivalent to the Gaussian Network Model (GNM) of Bahar et al. |^, ^ which has been 
widely used to characterize the mobile regions of the native state from the analysis of the normal modes of the 
structure. These represent the natural oscillations (independent of each other) of a protein and convey information 
about which regions are more flexible, and how thermal energy is dissipated in order to restore equilibrium (i.e. the 



3 



native structure). Although atoms in proteins are as tightly packed as in crystalline solids, the vibrational properties 
of native structures are very different Q. A notable example is given by the density of eigenvalues of the L matrix, 
that is the histogram of vibrational frequencies. This is shown in Fig. ^ where we concentrated on two examples: 
an individual enzyme, the monomer of the HIV-1 protease, and the average histogram calculated for several proteins 
known to fold via the simplest possible mechanism (two-state) |^0| . The peculiarities of the protein spectrum can be 
ascribed to several distinctive protein properties such as the varying degree of site connectivity (burial profile) or the 
neat hierarchical organization of the native state in terms of secondary motifs. 

To identify how the vibrational spectrum is affected by these general features we shall not take the crystalline solids 
as a reference, but consider several families of disordered contact maps. For example, to isolate the role played by 
the burial profile from other features we first consider disordered matrices obtained by taking the contact map of a 
real protein and randomly reshuffling its entries yet preserving the symmetry of the matrix, the burial profile and the 
contacts between consecutive sites. Successively we shall consider the case of completely disordered maps but with 
equal connectivity (burial) for each site. It is important to stress that in these two cases of disorder, the A matrices will 
not, in general, be physically viable. In fact, it is not guaranteed that there exists a viable three-dimensional backbone 
associated to an arbitrary symmetric contact map. For this reason, we complete our analysis by considering the case 
of contact matrices of three-dimensional compact self-avoiding structures generated randomly with a computer with 
the constraint to be as compact as naturally-occurring proteins. This case will provide a useful term of comparison for 
identifying the spectral properties resulting from the neatly organized three-dimensional shape of naturally occurring 
protein conformations. 

We begin by considering the first instance of disorder, i.e. random reshufHings that preserve both the symmetry of 
A and also the protein native burial profile, i.e. the native number of contacts to which each site takes part The 
reshufflings consist in picking randomly a pair of non-zero distinct entries, A^j and A^.j and checking if the entries A^; 
and Afcj are zero. If this is so, the old pair of entries (and their symmetric counterparts) are set to zero and the new 
ones to 1. The preservation of the connectivity of each site allows to study the impact of disorder on the vibrational 
frequency spectrum while keeping burial profiles unaltered. As anticipated above, such randomized contact maps 
are not necessarily the counterpart of any feasible three-dimensional structure; in fact, a more appropriate view of 
such matrices is within the framework of graphs |3l[ |3^, (the nodes and links corresponding to beads and springs, 
respectively) . 

A very important difference shown in Fig. |l| is that proteins have a larger number of low-frequency modes of 
vibration compared to the spectrum of the reshuffled maps. This difference is made more dramatic by considering the 
spectrum of loopless regular graphs. The term regular here refers to the fact that all sites have the same connectivity 
(burial), k, but the entries of the associated symmetric matrix. A, are otherwise random p3| . 

This limit case of disorder is important because the average spectrum of such ensemble of matrices has been 
calculated exactly by McKay pij . The density of eigenfrequencies is given by: 



I. fc^-L-fcl^ for - fc| < 2VF 



y otherwise, 

and its distribution is shown in Fig. |l| for the case fc = 8, corresponding to the average site connectivity (including 
peptide bonding) in real proteins (when the interaction cutoff is around 7.5 A). 

It appears that the range of vibrational frequencies is severely limited compared to that of real proteins; in fact, 
the spectrum of such graphs reproduces only the central part of the frequency histogram of proteins. The outlying 
tails at both low and high frequency are not captured at all. Previous studies have pointed out how such tails reflect 
the existence of heterogeneity in site connectivity (burial) [^6| This fact, and its implications for the mechanical 
properties of natural biopolymers are examined in detail in the next section. 



Localization properties 
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In a variety of contexts, from the physics of disordered systems to graph theory, spectra have always attracted 
considerable attention because of the special localization properties of the associated modes. High frequency normal 
modes are usually concentrated on sites/regions with high connectivity ||2j, ^ 



, |39[|. This is intuitive 

since a larger number of connections leads to an enhanced local stiffness and hence a higher frequency of vibration. 
Conversely, low frequency modes are centered on sites/regions that, having fewer connections than average, are more 
flexible This overall picture applies remarkably well to the reshuffled contact maps, as visible in Fig. ||. The 
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FIG. 1: (top) Density of eigenmodes for (a) HIV-1 Protease (heavy line), (b) average over 3000 conservative reshufBements of 
HIV-1 PR contact map (light line). The dotted line is the exact density of eigenmodes for a regular graph with k = 8 (see 
text), (bottom) Average density of eigenmodes for the two-state folders listed in the caption of figure |5| (heavy line) and for 
100 contact map reshufBements for each of them (light line). Reported data refers to near- native vibrations, corresponding 
to r = in our model. Previous theoretical studies, carried out with full atomic detail, have shown that the peak of the 
eigenmode distribution falls around a frequency of 45 cm~^ |35| . 

plots show that the modes at low [high] frequency are very localized on sites with low [high] connectivity (burial). 
The degree of localization of a normahzed eigenmode of vibration, {v} is defined as J2j bjl^ (see eg. [Q). This 
measure will take on the largest value, 1, if all entries of {v} but one are zero (full localization). On the contrary, the 
localization measure will be minimum when all the vector components are equal in modulus (full delocalization). 

It is now interesting to turn to the case of proteins. Fig. ^ shows that high-frequency modes are, as before, highly 
localized on heavily buried sites. Notice that the highly localized states are not limited to the upper edge of the 
spectrum. The picture, however, changes for low-frequency modes, which appear to be significantly more delocalized 
than for the reshuffled case. This denotes a degree of flexibility that is significantly larger than randomly connected 
graphs (yet with the same burial profile!). This is not a peculiar feature of the HIV-1 Protease monomer, but is more 
general, as illustrated in the right panel of Fig. ^ displaying averages taken over the two-state folders listed in the 
caption of Figure ^ and their reshufflements. The averages in Fig. |^ are reported only for the 40 slowest modes, which 
cover a similar range of frequencies across the two-state folders. The inhomogeneity of the lengths of such proteins 
prevents from taking straightforward averages over the whole frequency range. 

Several previous investigations of normal modes and spectra of proteins |5[ ^ 



furthermore it has been argued that the slowly moving regions are the natural candidates for biological functions 
involving structural rearrangements. It is important, however, to note that also the burial-preserving reshuffled maps 
have low-frequency vibrations. 

The novel message of Fig. ^ is that the slow modes of random graphs are nearly fully localized (as high frequency 
modes), while in proteins one encounters the slow movement of regions spanning several residues [p^ . This special 
property, which does not depend on the burial profile, must result from the special organization of native contacts which 
provides a remarkable degree of flexibility. This, in turn, is associated with a rate of thermal energy dissipation that 
is slower compared to disordered graphs. The natural measure for the limiting rate at which mechanical excitations 
decay is, in fact, given by the smallest frequency (eigenvalue) of L. Equivalently, the slowest vibrational relaxation 
(decay) time in the system, which we shall denote as Aq, is given by the inverse of the smallest vibrational frequency. 



The immediate cause for this further difference is simply described in terms of graph properties pn| . There is a 
well-defined relationship between the slowest relaxation time and the average contact-diameter (meaning the average 




Ao = l/wo- 
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Eigenmode Rank (HIV-1 PR) Eigenmode Rank (2 state folders) 

FIG. 2: (top left) Degree of burial (number of native contacts) of sites that are at the center of the various eigenmodes. 
The eigenmodes are ranked according to increasing vibrational frequency. The center of an eigenmode is the site where one 
encounters the largest component (in absolute value). (bottom left) Degree of localization of the eigenmodes. The smooth solid 
lines pertain to averages over 3000 reshufflings of the protease (monomeric) contact map (but preserving the site connectivity) . 
The dotted line relates to the native state, (right) Degree of localization of the first 40 eigenmodes averaged over the two-state- 
folders (dotted line) and over 100 reshufHements for each of them (solid line). 

number of native contacts, including those between consecutive sites, that have to be traversed to go from an arbitrary 
site to another). This relation is neatly visible in Fig. which displays results for the distinct two-state folders and 
an equal number of their reshufHements. 

It is important to see that, among all proteins, the ones with many local contacts (all-a family) tend to be at 
one end of the scatter plot, while proteins belonging to the all-/? family are closer to the case of reshuffled contact 
maps. The relaxation of real proteins appears to be also slower than randomly-collapsed three-dimensional structures. 
Within our simplified approach this difference is most visible at finite temperatures, as illustrated in Fig. ^, where 
the relaxation rates are computed at T = 2Tp. 

The slowest relaxation encountered in proteins can be understood from the fact that the establishment of a few 
random contacts, very rapidly makes the structure rigid, and less susceptible to further rearrangements. The hierar- 
chical organization of contacts in proteins, in terms of secondary motifs further arranged in the tertiary structure, is 
responsible for the enhanced flexibility visible in Fig. |^. In fact, the relaxation times of isolated secondary motifs, both 
a and /?, of equal length is nearly the same and, again, much higher than a corresponding random reshuffling. This 
is illustrated in Fig. ^. From these facts it is tempting to speculate that a- and /3-like motifs are among the configu- 
rations that, for a given burial profile, have the largest relaxation times. This is particularly intuitive in the case of 
helices since their periodicity (modulus boundary effects) is immediately associated to the presence of slowly-decaying 
excitations. 

Folding rates 

In this subsection we shall explore the intimate connection between the relaxation time of slow eigenmodes of 
protein structures in thermal equilibrium and the protein folding rate. It is tempting, and physically appealing, to 
speculate that the relaxation rate of partially folded states may be an indicator of the folding velocity. This stems 
from the observation that low-frequency vibrations involve extended protein segments and require less energy to be 
activated. Hence, structural rearrangements from the unfolded state to the native one should occur more easily when 
the relaxation times of partially folded states are higher. This qualitative argument is also supported by a different 
reasoning where the folding of a protein is viewed as a diffusion in the complicated free-energy landscape until the 
native-state (at the minimum of the accessible free energy) is reached. Different native shapes will then be associated 
with landscapes that, despite being biased towards the native conformation, may have significant differences in terms 
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FIG. 3: (left) Scatter plot of the slowest native relaxation times (T = 0) versus the average native contact diameter. The 
open symbols denote data pertaining to the two-state folders listed in the caption of Fig. H. The open triangles, square and 
circles denote proteins belonging to the a, /3 and ajd families respectively. The filled circles denote the slowest relaxation time 
observed with a random (but burial-preserving) reshufflement of the native contact map of each of the two-state folders, (right) 
Scatter plot of the slowest relaxation times at T = 2 Tf versus the average native contact diameter. The data pertain to the 
two-state folders and an equal number of computer-generated disordered compact structures with same length distribution 
(filled squares). 



of numbers of metastable minima, height of the barriers between them etc. The corrugation of the landscape will 
undoubtedly affect the folding rate. The relaxation time of a structure in thermal equilibrium reflects the average 
ruggedness of the landscape. In a smoother landscape a protein will be able to make larger conformational changes 
for the same amount of dissipated energy. We can thus conclude that the native state is reached with greater difficulty 
if, during the folding process, the partially-formed structures have smaller equilibrium relaxation times. 

In agreement with this picture, we have found a notable interdependence (see Fig. ^ between the experimental 
folding rates, Kp, and the slowest relaxation rate, (in dimensionless units) which dominates the long-time relaxation 
kinetics in equilibrium. At temperatures much larger than the folding temperature, Tp (identified as the temperature 
at which the peak of the specific heat is observed), the relaxation time, Aq = l/^o conveys little information about 
the folding rate, since virtually no contacts are formed; this is true also at low T, where it measures only the rate of 
thermal dissipation of the fully-folded native structure. 

The highest correlation is observed at finite temperatures higher than Tp (see Fig. I). Although we believe that the 
precise location of the peak of the correlation may be sensitive to the model details, this result is plausible since it is 
above the folding transition that the significant search in conformation space occurs. The correlation coefficient in the 
neighborhood of the optimal temperature is r = 0.73. It is possible to assess its statistical significance by comparison 
with the null case of no correlation between relaxation rates and folding times. In the absence of any pair correlation 
between variables with converging moments, one expects that the correlation coefficient measured over a finite sample 
is normally distributed. This allows to calculate exactly the probability to encounter a correlation coefficient higher 
than r — 0.73 if relaxation and folding times were statistically independent. This probability turns out to be equal to 
2 • 10~^ which testifies that the observed dependence between the two quantities is truly significant. 

These results add to previous studies where folding rates were predicted with analogous level of significance on the 
basis of contact locality ||^ or different topologic indicators (such as the clustering coefficient or cliquishness) based 
on a graph-theoretical description of a protein's non-bonded interactions [p2[. 
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FIG. 4: Frequency spectrum at T = of (a) helix 11-31 from protein lc3w and (b) the beta sheet 85-105 from protein laac 
(continuous hue) . The dots denote the frequency spectrum calculated over 100 burial-preserving random reshufflements of the 
original contact map. The inset displays the histogram of the slowest relaxation times, Ao for the reshuffled maps. The slowest 
relaxation time for the original helix and beta sheet are denoted by an arrow in the respective figures. 
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FIG. 5: Linear correlation coefficient between the logarithm of the theoretical relaxation time, log Ao, and experimental folding 
rates for a set of proteins known to fold via a two-state mechanism. The inset shows the scatter plot of the actual theoretical 
and experimental values at a temperature close to where the best correlation, r = 0.73 is observed. Notice that Kf spans 
several orders of magnitude. The folding rates were desumed from the experimental studies collected in |^ |3^, ^ . 

The set contained the proteins listed below, a family: limb, 2abd, limq, lycc, Ihrc; oi/ P family: 2gbl, ldiv.n,2ptl, Icoa, Ihdn, 
ldiv.c,lurn, laps, Ifkb, 2vik; /3 family: lshg,lsrl,lshf.a, ltud,lcsp,3mef,2ait,lpks,lten,lwit,lfnf (9FN-III and lOFN-III). When 
multiple experimental entries were available for the folding rate of a give protein, we used considered the arithmetic average of 
logiKp). 



B factors 



The choice of a simple Gaussian Network Model to investigate the biological function of proteins has been motivated 
by a striking qualitative agreement between the experimental B factors (or temperature factors) measured in X-ray 
diffraction experiments and the mean square residue displacement calculated theoretically from the expression pq, 
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FIG. 6: Correlation between the theoretical B factors and the experimental ones reported in 30 X-ray resolved proteins (filled 
bars) and NMR resolved ones (open bars). The proteins in the two classes were ordered according to increasing lengths (the 
overall length range is 39-258). The PDB code of the proteins adopted here are reported below. X-ray (1 to 30):lshg, Ifxd, 
ligd, fail, Ihoe, Ivcc, Icei, lopd, Ifna, Ipdr, Ibeo, Ipoa, Imai, Ibfg, llaa, Ikuh, Icof, llcl, Ipkp, Ivsd, Inpk, Ivhh, Igpr, Isfe, 
lamm, lido, Ikid, Icex, Ichd, Ithv; NMR (31 to 56): laqr, laa3, likl, IboO, lawj, laSc, lalw, Ibhu, locd, Ight, Itrs, Ijli, 
Ifwq, Itam, lahk, Itbd, Igdf, Ijoo, Ifls, Ibuy, lil6, laa9, lak6, la23, 11x1, leza. 



k 

where i is the residue index, vf is the i-th component of the A;-th mode whose eigenfrequency is Wk- The sum in 
eqn. (P ) is over k such that uj^ ^ 0. Several alternative methods exist to predict (or refine) experimental B factors 
p^ , |35|, |57|, ^ |5^. The appealing feature of this Gaussian approach is that it is possible to give a good account of the 
experimental B factors by exploiting only the connectivity information contained in the native structure. A recent 
work by Halle has remarked the heavy influence that the native-state topology exerts on the measured i3-factors; 
in particular, it was shown that the mobility of an amino acid anti-correlates with its degree of burial (estimated as 
the number of non-covalent bonds to which it takes part). Besides the fundamental effect of the burial profile, the 
GNM allows to model the influece on B-factors of other important topogical features, such as the locality of contacts 
and presence of highly-interconnected clusters of interactions. In addition, the decomposition of dynamical motion 
into independent modes allows to trace which modes of vibrations are most responsible for the mobility of a given 
site or protein region, thus providing useful hints about the protein biological function. 

In this section we perform an analysis of the correlation between the theoretical and experimental B factors, aimed 
at investigating the role played by the single vibrational eigenmodes in the overall observed temperature factors. 

We stress that in our model, both eigenfrequencies and eigenmodes are temperature-dependent. We begin our 
analysis by considering the T ^ limit of our model, that corresponds to the GNM studied in refs. Q. 

We analyzed 30 protein structures, all obtained in X-ray diffraction experiments, of different size (up to about 200 
amino acids). The proteins, listed in the caption of Figjq, are chosen among the monomeric representatives of distinct 
structural classes obtained from high-resolution data |6^. Thermal fluctuations for each amino acid were compared 
with the corresponding temperature factor reported for the CA atoms in the PDB file |6^. Although a remarkable 
agreement with experimental data can be obtained for several proteins (correlation coefficients larger than 0.75 for 
proteins with nearly 200 residues), in some rare instances the correlation was around 0.4 (see Fig. H). It must be 
borne in mind, however, that such correlations are never trivial, since they are measured over the entire protein length. 
Assuming, as before, that in case of no correlation the distribution of r is normal, we can calculate the probability 
to observe higher correlations than the measured ones. Even in the worst case, this probability is never larger than 
t = 3 10~^ which is an indicator of excellent correlation. This estimate of statistical significance is so small that even 
if more precise calculations (e.g. accounting for the skewness in the B-factors distributions) modify the estimate of t 
by two orders of magnitude, the correlation significance would be high. 

Although statistically relevant, the agreement with temperature factors does not have the same quality for all crystal 
structures. It is important to remark that the crystallographic _B-factors are not obtained by direct measurement. 
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FIG. 7: (a) Correlation coefficients between single eigenmodes amplitudes (only one term at a time is considered in eqn. (0)) 
and experimental B factors as a function of their corresponding eigenfrequencies. The points refer to all eigenmodes of the 
X-ray structures listed in the caption of Fig. ^(b) Histograms from data in (a). Positive correlations come from slow modes, 
negative from fast modes. 



but from a suitable parametric fit of the measured diffraction pattern. In unfavorable conditions (affected also by 
crystallographic resolution), the fit may be significantly underdetermined and reliable data can be obtained only 
imposing additional constraints on the B factors (such as their smooth variation along the main- or side-chains). 
Inspection of the PDB files for which the worst correlations are seen have revealed that, indeed, those proteins 
lacked the S-factor smoothness. Another possible source of discrepancy between theoretical and experimental data 
is the fact that we neglected the proximity effect of the surrounding proteins in the crystal structure. The presence 
of neighboring proteins might reduce the mobility of some regions (and notably those experiencing a pronounced 
freedom of movement), as already noted in previous studies |5^ . 

Equation (^) clearly indicates that most of the contribution to the theoretical B factor comes from low-frequency 
modes, since the fast ones are suppressed by their frequency reciprocal weight. It is sensible to ask how much each 
individual mode correlates with the experimental B factors. If a clear trend of such correlation against the mode 
frequency is found, one could devise a better (knowledge-based) weighting scheme, alternative to the one of eqn. ^ to 
improve the theoretical estimates of mean square displacement of residues. To answer this question we have obtained 
for each mode, k, the linear correlations between the experimental B factors along the chain, {B^^^} and the mode 
amplitudes, P}- Such correlations, computed separately for each mode of the proteins listed in Fig. || are reported 
in Fig. 

The majority of positive correlations are found in the region of slowest eigenmodes (eigenfrequencies up to 6), while 
there is a negligible correlation for middle frequencies, and a weak anti-correlation for fast eigenmodes (frequencies 
larger than 12). This is evident in figure ^ where one can compare the histograms of the correlation coefficients 
cumulated (with equal weights) over fast and slow eigenmodes. The anti-correlation of the experimental i3-factors with 
the square-amplitudes profile of fast eigenmodes is in agreement with the graph theoretical prediction of a localization 
of the fastest eigenmodes in nodes of higher connectivity. These nodes are less mobile in the protein structure and 
hence associated with smaller B values. Furthermore, since all eigenmodes are orthogonal, the sites whose vibrational 
amplitude is large in the slowest eigenmodes, will not be very mobile in the fastest ones, and viceversa. 

Following this analysis we can conclude that low eigenmodes cooperate to generate much of the fluctuation pattern 
of the protein structure, while intermediate modes have little influence and fast modes could be safely excluded from 
the sum, since they contribute to raise the mobility of highly connected nodes. 

Finally, we observe that the theoretical B's carry a non trivial dependence on temperature through the eigen- 
modes (due to the temperature dependence of matrix L). Fig. ^ summarizes the behaviour of the correlation with 
experimental data against temperature for four representative proteins. It can be seen that, typically, by working 
at a temperature around 2 Tp the correlation with experimental data can be increased significantly. As visible, an 
exception to this trend is given by the few proteins which already have a poor correlation at T = 0. 
We have considered also the case of NMR proteins, for which accurate relaxation measurements can directly probe 
the mobility of local portions of the protein thus providing an effective B factor. Strikingly, the correlation between 
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FIG. 8: Correlation between theoretical and experimental B factors as a function of temperature. 

the theoretical and the experimental B factors observed in this situation was systematically higher than for the X- 
ray resolved counterparts, as can be ascertained from Fig. |[ It would be tempting to conclude that this improved 
agreement is due to the more direct way with which NMR experiments can address local vibrational amplitudes. 
However, due to the insufficient annotation of PDB files of some NMR-resolved structures, it is not always possible 
to ascertain whether the reported data follow from precise relaxation measurements or from an a posteriori analysis 
of the various model structures compatible with measured distance-constraints. 

B factors and key sites 



The analysis of the mobility, in thermal equilibrium, of different protein residues carries significant information 
about sites that are important in the folding process. We illustrate this important application for the case of the 
HIV-1 Protease. This enzyme represents an ideal benchmark due to the vast number of clinical studies thanks to 
which comprehensive tables of key mutating sites have been compiled ^ ^ Molecular dynamics 

studies have attempted to clarify the special role played by these amino acids either through all-atom simulations 
of the protease or through topological folding models |^ . It appears that a number of key mutating sites form 
contacts that act as rate limiting steps for the folding process 

One attempt to describe/predict this crucial set of amino acids from the analysis of normal modes was carried out 
in Refs. |2^, Q within GNM (to which our model reduces when T = 0), who pointed out that high frequency 
modes are localized close to key mutating sites Our results, especially those of Fig. |^, support the view that the 
sites where the high frequency modes concentrate are paramount for native stability; in fact they are among the least 
exposed ones. However, this property is only related to the native structure, and does not imply that the same sites 
play the leading role in overcoming the folding rate-limiting steps. 

A more direct strategy would be to determine the handful of contacts that form cooperatively at, or in the neigh- 
borhood of, the folding transition temperature. The formation of such contacts, which has an all-or-none character, 
is expected to constrain significantly the mobility of the residues involved in their establishment. This scheme, which 
has been confirmed a posteriori in simplified folding models of the HIV-1 PR and prion j^, can be naturally 
adopted in the present case. In fact, a natural and convenient measure of the degree of spatial constraint of a given 
site is provided by the associated B-factor. Thus, near the folding transition temperature, one expects that the key 
sites have nearly native like values for the B factors, while other residues will have much larger fluctuations than in 
the native state. Therefore, one may identify the key sites as those with small values of B or small derivatives of B 
with respect to temperature. It turns out that both these criteria work well, as illustrated in Fig. |9[ Among the top 
13 sites ranked according to the smallness of B factors there are 5 key residues: 32,33,77,84,30. In the top 13 sites 
ranked according to the derivative of B there is an additional key site, namely residue 82. The probability to observe 
at least these many matches had we chosen the sites randomly, would have been 3 % and 0.5 % respectively in the two 
cases. It must be stressed that the present strategy to identify key sites exploits the information on native topology 
in a non-trivial manner avoiding, for example, the pitfall of selecting the most buried sites as the key ones. In fact. 
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FIG. 9: Scatter plot of B-factors and their temperature derivative for the sites of HIV-1 PR monomer. The key sites known to 
cause resistance to protease-inhibiting drugs are highlighted. 



to collect the same number of correct matches found before, one needs to retain about twice as many burial-ranked 
sites than for the B-factor-ranked case. We have verified that this conclusion holds for all interaction cutoffs in the 
range 6.5-8 A. This very good statistical validation shows that the physically appealing identification of the key sites 
through normal modes analysis can be fruitfully applied in contexts of high practical importance. 



CONCLUSIONS 



We have focused on the characterization of both the folding process and equilibrium physical properties of natural 
proteins through the normal modes analysis of a suitable topological model. Previous investigations of normal vibra- 
tions around the native state have proved a valuable tool to describe the large scale motion of proteins and obtain 
clues about their biological functionality. 

Although we also focus on specific protein instances, such as the HIV-1 protease, our scope is more directed at finding 
some general characteristics of proteins' dynamics in thermal equilibrium (at various temperatures) that distinguish 
these biopolymers from a randomly compact polymer. This comparison has a two-fold objective: one of theoretical 
interest, the other more practical. 

/^From the point of theoretical biophysics, the identification of features that are unique (and common) to proteins 
provides clues about the special evolutionary criteria that have promoted the selection of the wide repertoire of 
naturally-occurring proteins yet using only a limited number of fold families ^Tj. In the detailed account 

provided here we have shown that proteins are very flexible and have long relaxation times, in thermal equilibrium, 
compared to globular polymers. In turn, the relaxation time of partially folded states is shown to have a significant 
correlation with experimental folding times (r can be as high as 0.73). We give a quantitative accoimt of the intimate 
dependence of the relaxation rates with native state topology; thereby providing a novel additional support to the 
influence of native-state topology over folding rate. It is also found that high-frequency modes involve as many 
sites in natural proteins as in disordered structures. Despite this, the structural regions involved in slow motions 
are much more extended in proteins. Without this property (common to the few tens of proteins considered here) 
it would probably be impossible for a protein to carry out the articulated mechanical tasks necessary for biological 
functionality The direct investigation of the vibrational properties of individual secondary motifs 

has shown that their presence and organization are distinctive features of protein structures. This observation adds 
to previous evidence according to which the ubiquity of secondary motifs in real proteins is due to the very special 
properties that they confer both to the native state and to the folding process. 

^From a more practical point of view, the knowledge of how these " special features" have arisen through the selection 
of certain viable native shapes, can be used as a predictive tool. In particular, the straightforward deterministic 
analysis of the topologic folding model adopted here allows to predict folding rates with high statistical significance. 
A further important application is in the " a priori" determination of the B factors that are essential in the refinement 
and validation of protein structures resolved by X-ray and NMR. A by-product of the analysis of B factors of partially 
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folded states is the identification of the sites (or contacts) that lock into their relative native position at early stages of 
the folding process. Taking the case of HIV-1 PR as a reference, we have shown that such sites are among those that 
are known to be involved in rate-limiting steps of the folding process. Therefore, within this single simple framework 
it is possible to model various aspects of the folding process and other equilibrium properties obtaining a thorough 
protein characterization that is consistent with available experimental results. 

This work was supported by INFM and MURST cofin 2001. GL's research has been supported by a Marie Curie 
Fellowship of the European Community Programme "Improving Human Research Potential and the Socio-Economic 
Knowledge Base" under contract number HPMF~CT~2001-01432. The authors are solely responsible for information 
communicated and the European Commission is not responsible for any view or results expressed. We are indebted 
to Jayanth Banavar, Ken Dill, Burak Erman, Doriano Lamba and Henriette Molinari for stimulating discussions and 
helpful suggestions. 

APPENDIX. METHODS: ANALYTICAL CHARACTERIZATION OF THE MODEL 

The simplified energy functional used to characterize the equilibrium or folding properties of a target structure, F 

is: 

N — l 

^ = ^ E ^'+0^ + 1 E ^^K^^ R"\p^^ (4) 

i=l i=jLj 

where denotes the distance-vector of the ith Cq, from its position in the native structure, F (assumed to be 
N residues long). K is the strength of the peptide bonds, and T is the absolute temperature (incorporating the 
Boltzmann constant). A'" is the contact matrix, whose element A^ is 1 if residues i and j are in contact in F (i.e. 
their Cq separation is below a cutoff c. (in this study c — 7.5 A) and otherwise. The matrix A?^ along with the 
native Cq, coordinates encodes the topology of the protein. In standard off-lattice approaches, the interaction V{d) 
between non-bonded amino acids at a distance d, is taken to be a square well potential, or some type of Lennard- Jones 
interaction. Our choice in Eq. (^) is a sort of "harmonic well" which, while being physically sound and viable, is 
suitable for a self-consistent treatment, as explained below. 

The temperature-dependent term pij is the probability that the separation of amino acids i and j fall within a 
distance R of the native separation. Therefore R plays the role of an effective outer rim of the quadratic potential 
well and can be set to a few Angstroms (i? = 3 A in the present study) to reflect the fact that, when the separation 
of two residues exceeds substantially the native one their interaction is negligible. The strength of the peptide bond, 
is set to 1/15 jlj]. Such choice of values guarantees that close to the folding transition temperature, nearly half 
of the native contacts are formed, consistently with several unrelated studies ^ . 

The simple quadratic form of Hamiltonian in Eq. (^) allows to determine exactly the Py's from the self-consistent 
relation: 

p., = (e(|r,-r,f 

where Q{x) is the unitary step function and the brackets denote the thermal averaging under the action of Hamiltonian 
^, that depends on p,y as well. Operatively, one starts from a trial choice of the pij , which is used to determine a new 
set of parameters through eqn. (|^). Convergence is obtained in a few tens of iterations at any temperature. Now in 
such self consistent approach the problem is fully solved and all equilibrium or dynamical quantities can be calculated 
exactly by evaluating analytical expressions. At T = the model correctly assigns the lowest energy to the native 
structure (all Py 's equal to 1). Upon increasing the temperature, each interacting pair will have increasing mutual 
separation (in modulus) and correspondingly, the 's will decrease to reflect the milder binding. In the limit T — > oo 
all Py 's tend to 0. It is worth pointing out that our energy functional treats non-native contacts in a neutral way: 
their formation is not favoured but is not discouraged either. As a consequence, at finite temperatures, non-native 
contacts have a non-zero probability of formation, which can still be computed through ||. 

Simple algebraic operations allow to recast the self-consistent energy function in the following form: 

H = ^riL,,rj-l^Ai;.i?2p^^. (6) 



13 



where 



K-T{2-5^,^-6i,N)/2 + Y.i^\iPi,i fori = j 
^'■^ \ -Pi,,Ai,,+ir-T (-<5i,,+i-<5„_i)/2 forz^j. ^'^ 

While the present form of the model does not accurately describe the effects of self-avoidance this does not lead to 
a qualitatively wrong behaviour in the highly-denatured ensemble (large T). The treatment of steric effects becomes 
progressively more accurate as temperature is lowered. In fact, the model guarantees that the native state is the 
true ground state and therefore protein conformations found at low temperature inherit the native self-avoidance. 
The connectedness of the chain, as well as its entropy, are captured in a simple but non- trivial manner. The most 
significant advantage of the model is that it can be used to explore the equilibrium thermodynamics without being 
hampered by inaccurate or sluggish dynamics. 
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